Whole Exome genomic analysis of NRG-GY003 ovarian cancer tumors
Test
Author
Affiliation
Kelsey Monson, PhD, MS
Icahn School of Medicine at Mount Sinai
Published
July 31, 2025
Keywords
R, Data Analysis, Data Viz
Intro
Variant Filtering Strategy
I first used the Nextflow Sarek pipeline to align the raw fastq files. Below is the workflow I used:
Note that, at this time, I have only merged the variants from the SNP/indel callers, as ASCAT and the other SV/CNV callers do not have .vcfs as their output.
Here is the detailed variant filtering strategy I used to come up with the MAF file used in this analysis:
First find consensus somatic calls:
Present in >=2 somatic callers (Freebayes, Mutect2, Strelka2)
Include variants annotated with PASS
PASS: indicates the variant passed all quality filters applied by the variant caller
Considered the gold-standard for high-quality variants passing the variant caller’s built-in quality control filters
Further refine PASS variants
Minimum tumor Read Depth = 20: ensures sufficient coverage (i.e. evidence) to confidently call tumor variants from normal and/or distinguish from random sequencing errors
Allele frequency 0.05 < AF < 0.95:
>0.05: also helps avoid random sequencing errors/miss-called variants
<0.95: helps avoid germline contamination
Identify rare pathogenic germline variants:
Present in >=2 germline callers (Freebayes, Haplotypecaller, Strelka2)
Identify those most likely to be pathogenic
Limit to protein-coding variants (drop all intergenic and non-coding variants)
Must be annotated with HIGH impact
Eliminate common variants
“Rare” variants are typically those present in <1% of the population
Because I was worried about missing variants, I initially excluded those with >5% population frequency, but this was too liberal.
I revised the script to set the threshold to 1% and will re-generate the MAF file eventually
For the current analysis, I QC’d the variants and only identified likey oncogenic variants in BRCA1/2, so limited the germline analysis to those genes.
Include all likely oncogenes regardless of population frequency (“BRCA1” “BRCA2” “TP53” “PTEN” “ATM” “CHEK2” “PALB2” “APC” “MUTYH”)
Generated consensus MAF file
The above analysis generated two vcfs, one with the processed somatic variants and one with likely pathogenic germline variants
These variants were annotated as SOMATIC or GERMLINE_PATHOGENIC, respectively in the vcfs
I then concatenated these variants to one MAF file for the entire sample using the Nextflow vcf2maf pipeline.
This consensus MAF is what is used for the downstream analysis presented here.
Loading in the data
Packages Used
# Load packageslibrary(maftools) # For majority of maf file analysislibrary(mclust)library("BSgenome.Hsapiens.UCSC.hg38", quietly =TRUE)library(NMF) # For signature calculationlibrary(pheatmap) # For nice heatmapslibrary(ghibli) # For pretty colors
Load in raw MAF file and the clinical data
laml =read.maf(maf="input/merged_consensus_all_samples_germline.maf", clinicalData ="input/NRG-GY00_laml_annot_2.csv",rmFlags =TRUE# "FLAGS, frequently mutated genes in public exomes" )
-Reading
-Validating
-Removing 20 FLAG genes
-Silent variants: 30668
-Summarizing
-Processing clinical data
-Finished in 3.380s elapsed (3.170s cpu)
Tip
Setting rmFlags = TRUE removes frequently mutated genes in public exomes
Data cleaning: setting levels for factor variables
# RECIST response (CR, PR, SD, PD)laml@clinical.data$response <-factor(laml@clinical.data$response, levels=c("CR","PR","SD","PD"))#table(laml@clinical.data$response)# Detailed response including durable and progressive SDlaml@clinical.data$response2 <-factor(laml@clinical.data$response2, levels=c("CR","PR","SD_CB","SD_NCB","PD"))# table(laml@clinical.data$response2)# Response by treatmentlaml@clinical.data$response_tx <-factor(laml@clinical.data$response_tx, levels=c("Nivo_CB","Combo_CB","Nivo_NCB","Combo_NCB"))#table(laml@clinical.data$response_tx)# Race (White, Non-White, and Not Reported)laml@clinical.data$race2 <-factor(laml@clinical.data$race2, levels=c("W","NW","NR"))#table(laml@clinical.data$race2)# Site (dichotomous)laml@clinical.data$Site2 <-factor(laml@clinical.data$Site2, levels=c("Adnexa","Non_Adnexa"))#table(laml@clinical.data$Site2)# Primary/Metlaml@clinical.data$Primary_Met <-factor(laml@clinical.data$Primary_Met, levels=c("P","M"))#table(laml@clinical.data$Primary_Met)
Subsetting datasets
We have some normal samples with no matched tumor, and we may be interested in doing subtype-specific analysis.
Let’s subset to only tumor samples, and then create further subsetted datasets based on clinical characteristics.
Only tumor samples
I also annotated with pathogenic germline variants; the only relevant ones were in BRCA1 and BRCA2, so we are subsetting to somatic mutations and pathogenic germline variants.
# Subsetting the other dataset to "not serous"`%ni%`<-Negate(`%in%`)laml_other =subsetMaf(maf = laml_tum, clinQuery ="celltype %ni% 'Serous_Adenocarcinoma'")
## Subsetting by treatment to see if there are differences### There shouldn't be any, since treatment assignment was randomized, but is is good to confirm.laml_nivo =subsetMaf(maf = laml_tum, clinQuery ="tx %in% 'Nivo'")
-subsetting by clinical data..
--43 samples meet the clinical query
-Processing clinical data
View the MAF file summary
Here is a basic summary of the MAF file
laml_tum
An object of class MAF
ID summary Mean Median
<char> <char> <num> <num>
1: NCBI_Build GRCh38 NA NA
2: Center . NA NA
3: Samples 86 NA NA
4: nGenes 6897 NA NA
5: Frame_Shift_Del 234 2.721 2.0
6: Frame_Shift_Ins 43 0.500 0.0
7: In_Frame_Del 83 0.965 0.0
8: In_Frame_Ins 6 0.070 0.0
9: Missense_Mutation 9193 106.895 81.0
10: Nonsense_Mutation 524 6.093 4.0
11: Nonstop_Mutation 13 0.151 0.0
12: Splice_Site 505 5.872 3.0
13: Translation_Start_Site 19 0.221 0.0
14: total 10620 123.488 92.5
These are some more summaries you can generate
# I'm commenting them out as they have too much info for the tables to be readable.# #Shows sample summary.# getSampleSummary(laml_tum)# #Shows gene summary.# getGeneSummary(laml_tum)# #shows clinical data associated with samples# getClinicalData(laml_tum)# #Shows all fields in MAF# getFields(laml_tum)
Visualization
Summarize the maf file
We can use plotmafSummary to plot the summary of the maf file, which displays number of variants in each sample as a stacked barplot and variant types as a boxplot summarized by Variant_Classification.
#Use mafbarplot for a minimal barplot of mutated genes.mafbarplot(maf = laml_tum)
Summarize the other subsets
Here are the mutation distributions for the other subsets.
As a sanity check, the majority of Serous samples have TP53 mutations, while there are few TP53 mutations in the top 10 for the non-Serous samples (as is expected).
sigpw plots enrichment for known oncogenic signaling pathways as defined in TCGA, Sanchez/Vega et al.
# Plot genes in 2 top oncogenic signaling pathwaysoncoplot(maf = laml_tum, pathways ="sigpw", gene_mar =8, fontSize =0.6, topPathways =2)
# Collapse to just plot the pathways, now top 5oncoplot(maf = laml_tum, pathways ="sigpw", gene_mar =8, fontSize =0.6, topPathways =5, collapsePathway =TRUE)
# Top 10 pathways*# *I've set `topPathways` = 10 but this is 24 pathways, not sure what's going on there...oncoplot(maf = laml_tum, pathways ="smgbp", gene_mar =8, fontSize =0.6, topPathways =10, collapsePathway =TRUE)